Maximum likelihood phylogeographic inference of cell motility and cell division from spatial lineage tracing data

Abstract Motivation Recently developed spatial lineage tracing technologies induce somatic mutations at specific genomic loci in a population of growing cells and then measure these mutations in the sampled cells along with the physical locations of the cells. These technologies enable high-throughput studies of developmental processes over space and time. However, these applications rely on accurate reconstruction of a spatial cell lineage tree describing both past cell divisions and cell locations. Spatial lineage trees are related to phylogeographic models that have been well-studied in the phylogenetics literature. We demonstrate that standard phylogeographic models based on Brownian motion are inadequate to describe the spatial symmetric displacement (SD) of cells during cell division. Results We introduce a new model—the SD model for cell motility that includes symmetric displacements of daughter cells from the parental cell followed by independent diffusion of daughter cells. We show that this model more accurately describes the locations of cells in a real spatial lineage tracing of mouse embryonic stem cells. Combining the spatial SD model with an evolutionary model of DNA mutations, we obtain a phylogeographic model for spatial lineage tracing. Using this model, we devise a maximum likelihood framework—MOLLUSC (Maximum Likelihood Estimation Of Lineage and Location Using Single-Cell Spatial Lineage tracing Data)—to co-estimate time-resolved branch lengths, spatial diffusion rate, and mutation rate. On both simulated and real data, we show that MOLLUSC accurately estimates all parameters. In contrast, the Brownian motion model overestimates spatial diffusion rate in all test cases. In addition, the inclusion of spatial information improves accuracy of branch length estimation compared to sequence data alone. On real data, we show that spatial information has more signal than sequence data for branch length estimation, suggesting augmenting lineage tracing technologies with spatial information is useful to overcome the limitations of genome-editing in developmental systems. Availability and Implementation The python implementation of MOLLUSC is available at https://github.com/raphael-group/MOLLUSC.


Introduction
Development of multicellular organisms is a spatiotemporal process involving growth, death, differentiation, and movement of cells.Lineage tracing, or inferring the complete history of cell divisions during the development of an organism or tissue, has been a key goal of developmental biology.In most organisms-including humans-the rate of naturally occurring somatic mutations is too low to provide sufficient phylogenetic signal to reconstruct a phylogenetic tree from single cells (Chen et al., 2022).There has been tremendous interest in dynamic lineage tracing technologies, which use genome-editing technologies such as CRISPR/Cas9 to induce heritable mutations at pre-defined locations in the genome which are then measured via single-cell sequencing of the cells (McKenna et al., 2016, Kalhor et al., 2018, Raj et al., 2018, Spanjaard et al., 2018, Wagner et al., 2018, Chan et al., 2019, Bowling et al., 2020).Multiple specialized computational methods have been developed to infer cell lineage trees from dynamic lineage tracing data (Jones et al., 2020, Feng et al., 2021, Chen et al., 2022, Gong et al., 2022, Seidel and Stadler, 2022, Sashittal et al., 2023, Mai et al., 2024).Some methods also use expression data in addition to the CRISPRinduced mutations (Zafar et al., 2020, Pan et al., 2023).Recently, spatial lineage tracing technologies have emerged that enable recording of spatial location of the cells in addition to their induced mutations (Chow et al., 2021, He et al., 2022, Chadly et al., 2024).Each of these published lineage tracing technologies offers unique combinations of spatial recording and lineage tracking technologies.This raises the question of how to infer a spatial cell lineage tree, a tree that records the history of both cell divisions and cell movements through space, from this data.
The inference of a spatial lineage tree is related to questions studied in phylogeography over the past several decades (Lemey et al., 2009, 2010, Nielsen and Beaumont, 2009, Bloomquist et al., 2010, Kalkauskas et al., 2021).A phylogeographic model describes both the evolutionary history and migration history of the species from the observed data of the extant species.In addition, from the inferred phylogeny and spatial distributions of the species, phylogeography studies can answer questions related to the interaction between genetic evolution and spatial migration.
Existing phylogeographic models, however, rely on assumptions that do not always hold in spatial lineage tracing data.For example, nearly all existing phylogeographic models are based on reversible Markov processes that allow for efficient computation (Lemmon andLemmon, 2008, Lemey et al., 2009).This is a reasonable assumption when there is a notable separation time between parent and child nodes (e.g.hundreds to thousands of generations), and thus directional biases in movement during one generation can be ignored.In contrast, the time scale in spatial lineage tracing is much shorter.
We find that existing phylogeographic models ignore a key property in real spatial lineage tracing data, namely that when a cell divides there is a symmetric displacement of daughter cells relative to the parent.We find strong evidence for such symmetric displacement in the intMEMOIR spatial lineage tracing data of mouse embryonic stem cells (Chow et al., 2021).
We propose the symmetric displacement (SD) model to describe the spatial location of daughter cells relative to the parent cell during cell division.To the best of our knowledge, this is the first attempt to model the displacement of cells during cell division in the context of lineage tracing.Combining the SD spatial model with the probabilistic mixed-type missing (PMM) model (Mai et al., 2024) for sequence data, we obtain a phylogeographic model for spatial lineage tracing.We derive a maximum likelihood (ML) method, MOLLUSC (Maximum Likelihood Estimation Of Lineage and Location Using Single-Cell Spatial Lineage tracing Data), that infers a spatial lineage tree from spatial lineage tracing data.The inferred spatial lineage tree contains both lineage and spatial information, including the time-resolved branch lengths, spatial diffusion rate, and sequence mutation rate.We show on simulated data that MOLLUSC has higher accuracy in estimation of branch lengths and mutation rate compared to solely using sequence data and accurately estimates the spatial diffusion rate.Applying MOLLUSC to intMEMOIR-a spatial lineage tracing dataset of mouse embryonic stem cells-we detect a clear correlation between cell radius and division displacement.

Spatial lineage tracing data and representation of cell lineage tree
Spatial lineage tracing data consists of two modalities: the observed CRISPR-induced sequences (i.e.character matrix)which we denote by S-and the spatial locations (i.e.(x, y) coordinates) of cells existing at the end of the experimentwhich we denote by L. Because these cells have been divided from a common ancestor (i.e. the progenitor cell), they form leaf nodes of a hidden phylogenetic tree of the cells, which we will refer to as the cell lineage tree.
The cell lineage tree is a rooted tree T ¼ ðV T ; E T Þ whose branch lengths measure time between consecutive cell divisions.We let L T denote the set of leaves of T. Let ðu; vÞ 2 E T be the edge from u to its child v (where u; v 2 V T ), and let r T denote the root of T. We will use "edge" and "branch" interchangeably.We assume that the root of T has exactly one child (the progenitor cell needs time to divide) and all other internal nodes of T have exactly two children (cells always divide into two).When the context is clear, the subscript T is omitted for brevity.Let δð�; �Þ denotes the distance of two nodes in T. For any edge ðu; vÞ 2 E T , we let δ v be the shorthand for δðu; vÞ.

Phylogeographic model
A typical phylogeographic model consists of two independent processes occurring on a lineage tree: (1) sequence evolution and (2) spatial diffusion.The joint likelihood of the sequence data S and cell locations L given T is the product of the two independent likelihoods: LðT; X; L; SÞ ¼ L L ðT; X L ; fδ v g; LÞL S ðT; X S ; fδ v g; SÞ; (1) where fδ v g denote the set of branch lengths of T, X S and X D denote the parameters of the spatial model and the sequence evolution model, respectively, and X ¼ ðΩ S ; Ω D ; fδ v gÞ is the set of all parameters of the phylogeographic model.

Brownian motion model for spatial diffusion
Consider a cell phylogeny T that has jL T j ¼ N leaf nodes.For a node v in V ðTÞ, let x v and y v denote the x and y coordinates of v.We assume that x v and y v are given (i.e.observed) for all leaf nodes v 2 L T and are hidden for all other nodes.Let x r T ¼ x 0 ; y r T ¼ y 0 ; L x ¼ fx u : u 2 L T g; L y ¼ fy u : u 2 L T g, and L ¼ ðL x ; L y Þ be the observed spatial data.Assuming the spatial diffusion on x and y coordinates are independent, for all ðu;vÞ 2 E T where u is the parent of v, the Brownian motion model (Lemmon and Lemmon, 2008) assumes that (2) where Nð0; σ 2 δ v Þ denotes a Gaussian distribution of mean 0 and variance σ 2 δ v , and σ is the diffusion rate.The model is illustrated in Fig. 1A.
A fundamental assumption of the Brownian motion model is the independence of the spatial location of the two daughter nodes given the location of the parent node.While this is a common assumption in a most research works in phylogeography, our analysis of the spatial lineage tracing data from the intMEMOIR technology (see section "Evidence of SD of cell division in intMEMOIR data" in Results) revealed that this assumption does not hold for the positions of dividing cells.Specifically, we detect a non-negligible displacements of the daughter cells from their parent right after cell division.Importantly, the displacements are symmetric, breaking the assumption about the independent locations of daughter cells given the parent.

SD model
We introduce the SD model to describe the spatial location of cells over a developmental process.The SD model is a composite of initial displacement by symmetric cell division followed by cell diffusion (movement by Brownian motion).Specifically, for a fixed cell radius r and a triplet (u, v, w) where u is the parent of v and w, we assume: where σ is the diffusion rate (i.e.σ 2 is the variance of the Gaussian model governing the Brownian motion), and ðr; θ u Þ are the polar coordinates of the daughter cells from the parent cell (Fig. 1B) reflecting SD at the division time.
The SD model generalizes existing models in the phylogeography and cell motility literature (Hall, 1977, P� erez and Prendergast, 2007, Codling et al., 2008, Jones et al., 2015, Wadkin et al., 2018).When r ¼ 0, the model reduces to the Brownian motion model, since fθ u g has no effect on spatial locations.Note that the SD model has one parameter θ u for each internal vertex and one parameter r for cell radius.This model has more parameters than the Brownian motion model and therefore, requires a larger amount of data for parameter estimation.The SD model also has some connection to the Arithmetic Brownian Motion (ABM) model (Royer-Carenzi and Didier, 2016), which assumes a continuous character evolves according to Brownian motion with some linear deterministic trend (i.e.given a successive time step of length s, the increment can be modeled as Nðμs; σ 2 sÞ for a linear constant trend μ).Our model differs as in our case we have a one time event (cell division) that does not scale with time between division times, as well as more importantly the symmetric nature of related daughter cells.

Likelihood computation under the SD model
Let X L ¼ ðσ; r; fθ u gÞ be the set of all parameters of the SD model.Let pðL x ; T; X L ; fδ v gÞ denote the likelihood of L x given T; Ω S and fδ v g under the SD model, and similarly define pðL y ; T; X L ; fδ v gÞ.Assuming the diffusion processes on the x and y coordinates are independent, we have: From Equation (3), we prove in Supplement that for all leaf nodes w and all pairs of leaf nodes v 6 ¼ w: (5) Here, Pathð�; �Þ denotes the path between two nodes, lcað�; �Þ denotes the least common ancestor of two nodes, and s u;v ¼ 1 if v is the left child of u and s u;v ¼ −1 otherwise.Therefore, the distribution of L x is multivariate normal, with mean l x and covariance R, where Rðw; wÞ ¼ σ 2 δðr T ; wÞ; for all leaf nodes w; Rðv; wÞ ¼ σ 2 δðr T ; lcaðv; wÞÞ; for all pairs of non À root nodes v 6 ¼ w: (8) Similarly for the y-coordinate, L y is multivariate normal with l y and covariance R, where l y is defined similarly.Thus, We show that Equations ( 9) and ( 10) can be computed efficiently by a generalization of the Felsenstein's algorithm for continuous traits (Felsenstein, 1973).Details are in Supplementary Section S4.

The evolutionary model of CRISPR-induced sequences
We model the evolution of the characters using the PMM model (Mai et al., 2024)  The PMM model is parameterized by the tree branch lengths fδ v g, mutation rate λ, heritable missing rate ν, and dropout rate ϕ.Let X S ¼ ðλ; ϕ; νÞ.We refer to the data obtained from the CRISPR-induced sequences as the observed character matrix, denoted by S, which is an N × K matrix where K is the number of target sites(following the convention in phylogenetics, we use target site and site interchangeably.)and N is the number of cells.Entries in column k of S take values in the set A ðkÞ ¼ f?; − 1; 0; 1; . . .; M ðkÞ g, where (a) A ðkÞ is the alphabet of target site k, (b) 0; −1; ?represent the unmutated state, silent state, and missing state, respectively, and (c) 1; . . .; M ðkÞ are mutated states.Under the PMM model, the likelihood of each site k of S is: where τ end is the length of the experiment (i.e.all cells at the leaves of the lineage tree were sampled at the same time, so they must have the same distance to the root in time unit).

Results
We provide evidence that the SD of cells can be observed in real data and that it has a noticeable effect on estimating the separation time between parent and daughter cells.Using this evidence as motivation, we benchmark MOLLUSC on (1) different simulation setups and (2) real data from intMEMOIR of dynamic lineage tracing sequences and cell locations, to illustrate the benefit of using the new spatial model on inferring time-resolved branch lengths and the spatial diffusion rate of the cells.

Evidence of SD of cell division in intMEMOIR data
intMEMOIR (Chow et al., 2021) is a recent experimental technology for single-cell spatial lineage tracing that involves a combination of inducible inheritable barcode edits with an imaging system for spatial resolution (refer Supplementary Section S1 for more details).One unique feature of this imaging technology is that ground truth lineage trees and ancestral locations are available.We analyze cell locations from consecutive time frames of the intMEMOIR data right before and after cell divisions.In particular, for every time frame t that contains a parent cell p that divides in the next time frame (t þ 1) into daughter cells c 1 and c 2 , we and compare the locations of the daughter cells c 1 and c 2 with the location of p. From our analysis, we observe that there are displacements of the cells because of division, which is distinct from the spatial diffusion of the cells before division that is usually modeled by Brownian motion in phylogeography.More importantly, we also show evidence for the SDs of the daughter cells from their parent.We first visualize the locations of each triplet of cells (i.e. the 2D coordinates of the parent cell p at time frame t and those of its daughters c 1 and c 2 at time frame t þ 1) (Fig. 2B).We observe that a majority of these triplets have the daughter cells symmetric from the parent cell.It is non-trivial, however, to either accept or reject the hypothesis of SD by division from such a discrete set of time frames.Note that the displacement is defined as the cell movement right after the moment of division.The discretized time frames, however, cannot fully capture this phenomenon, as we only know that cell division happened some time between the two time frames that are recorded, but cannot be certain about the exact moment of cell division.As such, there can be small diffusion of the daughter cells between the actual time of cell division and the time their locations were recorded, distorting the analysis on SD.
To test our hypothesis about the symmetry of cell displacements versus a sole Brownian motion displacement of the cells, we conducted a more in-depth analysis.We analyze the angles formed from the daughter cells and the parent cell.We name such an angle the daughter-parent-daughter angles, and denote it by αðτÞ, where τ is the time between the recording times of the parent cells and the two daughters.The hypothesis about SD implies that when τ !0, the angle approaches 180 � .However, because the resolution of recording only allows a minimum τ ¼ 1, there is a distribution of α among the collected triplets.Nevertheless, from Fig. 2B which shows α at τ ¼ 1, we can observe that α is usually large (close to 180 � ).
In contrast, if there is no displacement by division and cells only move according to Brownian motion, then the daughter cells can form any angle around the parent cell with the same probability, so α is uniformly distributed in ½0 � − 180 � �.We show that the empirical distribution of α at τ ¼ 1 is very different from that of the Brownian motion model, with a much higher concentration at the large degree closer to 180 � (Fig. 2C, where data are collected for all slides of intMEMOIR).This fact allows us to reject the Brownian motion model at τ ¼ 1 and supports the hypothesis of SD right after cell division.

Phylogeographic model for spatial lineage tracing i231
We further study cell movement after division at longer time intervals.By analyzing the distribution of αðτÞ at τ > 1, we test for the following two hypotheses: (1) the daughter cells continue moving in the direction defined by the initial displacement from their parent, and (2) each daughter cell diffuses in space independently of the initial direction of the displacement from their parent.Note that if (1) holds, then the daughter cells always maintain symmetric positions around the parent cell, so αðτÞ should be large.In contrast, if (2) holds, the angle can change to any value after division time and the distribution approaches a uniform distribution when τ ! 1.By constructing the empirical distributions of α at different τ (Fig. 2D), we observe that the distribution of α gradually converges to the uniform distribution with increasing τ, and at τ ¼ 50, which is the maximum observed branch length of the lineage trees in the intMEMOIR data, we see that the distribution of α is very close to the uniform distribution possessing by the Brownian motion model.This result supports the application of the Brownian motion model independently for each daughter cell's diffusion beyond their initial displacements from the parent cell.

Evaluation of MOLLUSC
We evaluate MOLLUSC on both simulated data and real intMEMOIR data.

Evaluation on simulated data
Using the proposed phylogeographic model, we simulated both sequence and location data of the cells.Both sequences and cell locations were simulated from the real lineage tree topologies (T) and real branch lengths fδ v g of intMEMOIR.
We filtered out the lineage trees that have less than 10 leaves, leaving us with 70 model trees used for simulation.For each of these model trees, we simulated spatial data following the SD model and sequence data following the PMM model.The parameters X L of the SD model and X S of the PMM model were selected to match statistics of multiple data modules of intMEMOIR: the frame-by-frame data, the imaging data, and the sequence data (see Supplementary Section S2 for more details).We set the parameters in the simulation as follows: diffusion rate σ ¼ 1:5, cell radius r ¼ 6.68, number of target sites K ¼ 10, alphabet size A ðkÞ ¼ f0; 1; 2g for every site k, and mutation rate λ ¼ 0:006.In addition, we simulated two other sets of spatial data that have σ ¼ 0:5 and σ ¼ 3 (other parameters were kept the same).These two additional datasets are useful in studying the effect of different diffusion rates on the performance of MOLLUSC.
We ran MOLLUSC on each of the following two scenarios: (i) fully simulated data: simulated sequence data and simulated location data were used as inputs to MOLLUSC, and (ii) semi-simulated data: simulated sequence data and real location data were used as inputs to MOLLUSC.

Fully simulated data
When running with simulated sequence and simulated location data, the estimate σ of MOLLUSC is unbiased on all model conditions that we tested (Fig. 3A).The variance of σ is smaller when both sequence and location data are given than when only location data is given, indicating the benefit of using both data modules for estimation.Interesting but not out of expectation, the variance of σ also increases with the true value of σ, indicating that when cell motility increases, there is more uncertainty in estimation of the diffusion rate.
Next, we study branch length estimation error (Fig. 3C).We measure branch length error by the mean absolute percentage error (MAPE), defined over the set of n branches fδ i g in a given tree and the estimates of those branch lengths f δi g as: If only location data is given, branch length error is smallest at σ ¼ 1:5, which is the diffusion rate that best matches the real intMEMOIR data.The error increases on both the lower (i.e.σ ¼ 0:5) and higher, that is, σ ¼ 3 values of σ.Interestingly, branch length error of Location Only (L) at σ ¼ 1:5 is very similar to that of Sequence Only (S), indicating that the sequence and location data have similar amount of phylogenetic signals (i.e. the average MAPE of Sequence Only is 0.758 and of Location Only at σ ¼ 1:5 is 0.813), and the difference is insignificant according to ANOVA test (P-value .256).Importantly, employing both sequence and location data significantly improves branch length estimation, where the average error reduces from � 0:758 in S to � 0:569 in S þ L. The difference between S þ L and S is significant, according to ANOVA test (P-value 2:8 × 10 − 6 ).Additionally, the error does not change much with σ if both sequence and location data are used.

Semi-simulated data
When using simulated sequence but real spatial data, we run MOLLUSC on varying values of r 2 f0; 1; 3; 5; 7; 10g and study the impact of r on the estimates of σ (Fig. 4A) and branch lengths (Fig. 4B).Recall that when r ¼ 0, our model reduces to the Brownian diffusion model, where cell displacement through division is ignored.Our results show that when r ¼ 0, σ is overestimated on all 70 lineage trees, and this overestimation happens regardless of whether the sequence data is used or not.Note that overestimation is expected, because when r ¼ 0, all the effect of SD through division is ignored and the spatial data is explained solely by Brownian motion, which in turn is parameterized solely by σ.In such a setting, σ is overestimated to account for both diffusion and displacement.When r > 0, the bias term reduces substantially and remains low ( ≤ 0.110) with all tested r 2 ½1; 5�.With larger values of r, however, σ is underestimated (i.e.σ is slightly underestimated when r ¼ 7 with bias -0.235 and more highly underestimated when r ¼ 10 with bias -0.386).
Examining the effect of r on branch length estimation (Fig. 4B), we detect an interesting U-shape structure where branch length error first decreases with r but later increases, hinting to an optimal value of r at between 5 and 7 that minimizes branch length error.As mentioned before, we hypothesize that r-which is the displacement magnitude in our model-is also the actual cell radius.To test this hypothesis, we use a provided image of the cells from intMEMOIR to manually detect the locations of cells and measure their radii (see Supplementary Fig. S1).From this study, we get the average radius of the cells to be � 6:68 and show it in Fig. 4B using the solid red line.This estimated radius is very close to the projected optimal value of r that minimizes branch length error, supporting our hypothesis about the connection between displacement magnitude and cell radius.
Next, we test different modalities of MOLLUSC on this dataset, including Sequence Only (S), Location Only (L), and Sequence þ Location (S þ L) (Fig. 4C).We observe that the error of estimated σ and branch lengths are both minimized at r ¼ 5. Therefore, we use r ¼ 5 as the default value of r for this data.We observe the same trend as in the fully simulated data: the branch length estimation error of Sequence þ Location (S þ L) at r ¼ 5 (average MAPE ¼ 0.678) is lower than that of Sequence Only (S) (average MAPE ¼ 0.797); the difference is significant according to ANOVA test with P-value 6:6 × 10 − 4 .Importantly, we do not gain improvement on branch length error if SD is ignored (i.e. S þ L at r ¼ 0), pointing to the role of SD in the model of cell motility.We also note, however, that when we use Location Only, having non-zero r sometimes has a negative impact on branch length estimation, possibly due to overfitting (i.e. as mentioned before, the model has many more parameters if r 6 ¼ 0 and requires more data for parameter estimation).Nevertheless, the difference between Location Only with r ¼ 0 and r ¼ 5 is insignificant (according to the ANOVA test, P-value .401).

Evaluation on the real intMEMOIR lineage tracing data
We evaluate MOLLUSC on the intMEMOIR lineage tracing data using both the measured mutations (sequence data) and measured cell positions (location data) at the final time.Note that unlike the other settings where the sequence data was simulated under the correct model that we assumed, in this case there can be model misspecification in Sequence data that amplifies the error of parameter estimation.
Results on estimation of σ and branch lengths (Fig. 5A and  B) are consistent with semi-simulated data.That said, if one ignores displacement through division (i.e.r ¼ 0), σ will be overestimated regardless of whether the sequence data is used or not (Fig. 5A).There is a good range of r around 1 to 5 where the estimate σ is unbiased, but increasing r beyond that leads to underestimation of σ.The result on the impact of r on branch length error is also consistent with the semisimulated data.We also detect a U-shape structure in the plot (Fig. 5B), and the optimal value of r that minimizes branch length error is still between 5 and 7, very close to the average cell radius of 6.68 estimated from imaging data.On this data, we also examine the accuracy of estimating the mutation rate λ, and find that the estimate is always accurate with or without spatial data (see Supplementary Section S5).
As before, we also test different modalities of MOLLUSC on this dataset, including Sequence Only (S), Location Only (L), and Sequence þ Location (S þ L) (Fig. 5C).There is a subtle difference in branch length error between Sequence Only and Sequence þ Location with r ¼ 0, but adjusting r to 5 gives a more pronounced improvement, especially on outliers.A unique feature of this real data, however, is the relative performance of Location Only (L) compared to the other two settings where sequence data is incorporated.Interestingly, we observe that solely using location data gives the best branch length estimation.This result shows that not only the location data has more signal for branch length estimation than sequence data, but also the incorporation of sequence data actually has a negative impact.While this fact motivates the usage of location data for lineage tree inference, at the same time it raises concerns about model misspecification in sequence data that can severely degrade the accuracy of lineage inference.

Discussion
Spatial lineage tracing technologies hold great promise for studying the temporal and spatial components of developmental processes, but are hindered by a lack of reliable computational methods and relatively poor sequence data quality compared to traditional lineage tracing.Here, we propose a novel phylogeographic model-the SD model-for spatial   lineage tracing data where we model cell motility as a composite of SD of daughter cells from their parent after cell division, followed by Brownian motion diffusion of individual cells.Combining the SD model with the PMM sequence model described in (Mai et al., 2024), we obtain the first phylogeographic model for spatial lineage tracing.In addition, we develop MOLLUSC, a ML inference framework designed specifically for spatial lineage tracing, where one can combine any spatial model with a sequence model to create a phylogeographic model and perform ML inference.
On both simulated and real spatial lineage tracing data, we demonstrate that the joint model of cell sequence and location gives superior branch length estimation than sequence data alone.These contributions are a first step towards understanding the nuanced interplay between how cellular function is informed by spatial context and ancestry, as well as how these two factors influence each other.The combination of new spatial lineage tracing technologies and novel phylogeographic models will enable deeper insights into the spatiotemporal processes of organismal development.
There are several possible directions for future work.First, the ability to automatically estimate the cell radius r instead of requiring it as an input from users is desired.We attempted to jointly estimate r with other parameters of the model in the MOLLUSC framework, but found a systematic bias when r and σ are co-estimated.Specifically, r tends to be overestimated while σ is underestimated (see Supplementary Section S6), indicating that a more careful treatment is necessary.One approach could be to allow the user to specify a prior distribution on r and find the maximum a posterior (MAP) estimate.Our results show there is a wide range of r (between 1 and 10) that the SD model has higher accuracy compared to the Brownian-only model, thus we speculate that a uniform prior on r to constrain r to be within a reasonable range would be sufficient for accurate inference.Second, one should try extending the model to other spatial contexts.The intMEMOIR data included cells growing on a dish and thus a Brownian motion was appropriate.However, real tissues and organisms may have other growth patterns that require more specialized spatial models.The literature on cell motility is rich with studies on the effect of chemotaxis (Horwitz and Webb, 2003, SenGupta et al., 2021, Roca-Cusachs et al., 2013), cell adhesion (Huttenlocher et al., 1995), collisions (DiMilla et al., 1991, Vicente-Manzanares et al., 2005), and collective behavior (Friedl and Gilmour, 2009;Rørth, 2009;Weijer, 2009;George et al., 2017), that would be interesting to consider.
Another direction is to develop a better model for sequence data.For example, one could model sequencing error-a missing ingredient in the PMM model (Mai et al., 2024).In addition, the assumptions about a constant mutation rate both through time (i.e.molecular clock) and across target sites (i.e.homogeneous sites) could be relaxed for more accurate estimation.While one can borrow from existing models and methods in phylogenetics literature to relax the clock (Drummond et al., 2006, Ho and Duchêne, 2014, Volz and Frost, 2017, Mai and Mirarab, 2021, Mai et al., 2022) and allow rate heterogeneity (Gu et al., 1995, Mayrose et al., 2005, Stamatakis, 2006), we note that applying these more sophisticated models to the sequence data of spatial lineage tracing that has small number of sites and mutated states can lead to overfitting, though we believe that spatial lineage tracing technologies with higher quality will emerge within the coming years.
Finally, we focus on the branch length and spatial diffusion parameter estimation in this study, however, our proposed ML inference framework can be extended to infer tree topology and locations of ancestral cells.While it is known that ML inference of tree topology is NP-hard (Roch, 2006), one can employ one of the popular heuristic approaches implemented in popular software packages (Minh et al., 2005;Guindon et al., 2010;Price et al., 2010;Stamatakis, 2014) for topology search.ML inference of ancestral locations under the Brownian motion model has been studied before (Lemmon and Lemmon, 2008).We believe such an approach can be generalized to the SD model.
, which is a continuous-time Markov model specifically designed for CRISPR-induced sequences.The PMM model captures the irreversibility and nonmodifiability of the CRISPR-induced mutations, as well as enables the inference of time-resolved branch lengths from sequence data.Below we give a summary of the PMM model.Refer toMai et al. (2024)  for more details.

Figure 1 .
Figure 1.Overview of the symmetric displacement (SD) model and MOLLUSC.(A) The traditional Brownian motion phylogeographic model, where an organism moves according random drift.(B) The center position of two daughter cells (light gray) are placed symmetrically at distance r from the center position of the parent cell following cell division.(C) The symmetric displacement (SD) model for spatial lineage tracing data.The spatial locations of cells are a superposition of the initial symmetric displacement of daughter cells from the parent cell followed by Brownian motion.(D) MOLLUSC computes a maximum likelihood phylogeography for spatial lineage tracing data using a joint likelihood of the sequences and the 2D spatial locations.

Figure 2 .
Figure 2. (A) The lineage tree with cell locations of the sample s10c1 of intMEMOIR.Although cell locations are given for all time frames, sequences are only recorded at the last time frame (i.e.Frame 216).In this work, we only use cell locations at the last time frame where sequence data is available.(B) Empirical locations of the parent cell (shown in black) and the two daughter cells (shown in blue and red) at the division frames of the intMEMOIR data (combined of all cells of slide 10).Each gray box indicates a triplet of one parent cell and its two daughters, added for visual purpose.A majority of daughter-parent-daughter angles (α) is at high degree (close to 180 � ) in agreement with the symmetric displacement (SD) model.See Supplementary Fig. S6 for visualization of some other slides of the intMEMOIR data.(C) Empirical distribution of the daughter-parent-daughter angles (α) at one time frame after cell division versus the theoretical distribution under the Brownian motion model.The empirical distribution is concentrated at a high value (close to 180 � ) indicating a symmetric displacement of the two daughter cells from their parent, while the Brownian motion model yields a uniform distribution of the angles.(D) Empirical distributions of the daughter-parent-daughter angles (α) after different number of time frames since the division frame.As the time from cell division increases the distribution of the angles approaches the uniform distribution consistent with the Brownian motion model.

Figure 3 .
Figure 3. Evaluation on spatial diffusion and branch length estimation error on simulated sequence and location data.Each point represents the result on one tree (70 trees total).(A) Difference σ−σ between the estimated spatial diffusion rate σ and the true rate σ using both sequence and location data, and using only location data.(B) Branch length error measured as the mean absolute percentage error (MAPE) for each tree with simulated sequences and simulated location data.

Figure 4 .
Figure 4. Estimation errors of spatial diffusion rate and branch lengths on the semi-simulated data.Each point is the result on one tree (70 tree topologies, with five simulated sequence replicates for each topology, giving 350 samples).(A) Estimation of σ−σ given different models (Sequence Only (S), Sequence þ Location (S þ L), and Location Only (L)).(B) Branch length estimation error, measured as the mean absolute percentage error (MAPE) versus r when both sequence and location data are used.The area shown is the space between the 95% confidence intervals, the middle line is the mean.The red vertical line indicates the cell radius estimated from frame-to-frame data of 6.68.(C) Branch length estimation error (measured as MAPE) under different models (S, S þ L, and L).

Figure 5 .
Figure 5. Estimation errors of spatial diffusion rate and branch lengths on the semi-simulated data.Each point is the result on one tree (106 trees).(A) Estimation of σ−σ given different models (Sequence Only (S), Sequence þ Location (S þ L), and Location Only (L)).(B) Branch length estimation error, measured as the mean absolute percentage error (MAPE) versus r when both sequence and location data are used.The area shown is the space between the 95% confidence intervals.(C) Branch length estimation error (measured as MAPE) under different models (S, S þ L, and L).
fδ v g; S ðkÞ Þ ¼ PðS ðkÞ ; T; X S ; fδ v gÞ where x denotes a realization of the ancestral sequences, E T and L T denote the edge and leaf sets of T, respectively, W and U are the transition probability matrix and the dropout matrix, respectively, as defined inMai et al. (2024).The sequence likelihood, L S ðT; X S ; fδ v g; SÞ, is the product of the likelihoods of the individual sites:L S ðT; X S ; fδ v g; SÞ ¼ Y K k¼1PðS ðkÞ ; T; X S ; fδ v gÞ:(12) 2.7 MOLLUSC: a maximum likelihood method to infer spatial cell lineage treeRecall that the joint likelihood LðT; X; L; SÞ of L and S is simply the product of L L ðT; X L ; fδ v g; LÞ and L S ðT; X S ; fδ v g; SÞ (Equation (1)).We infer the maximum likelihood tree from L and S by solving the following constrained optimiza-